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MODEL ADAPTATION APPARATUS, MODEL ADAPTATION METHOD, STORAGE 
MEDIUM, AND PATTERN RECOGNITION APPARATUS 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention relates to a model adaptation 
apparatus, a model adaptation method, a storage medium, and 
a pattern recognition apparatus, and more particularly, to a 
model adaptation apparatus, a model adaptation method, a 
storage medium, and a pattern recognition apparatus, which 
are suitable for use in speech recognition or the like. 

2. Description of the Related Art 

Methods of recognizing a word or the like uttered in a 
noisy environment are known. Representative examples 
thereof include a PMC (Parallel Model Combination) method, a 
SS/NSS (Spectral Subtraction/Nonlinear Spectral Subtraction) 
method, and a SFE (Stochastic Feature Extraction) method. 

The advantage of the PMC method is that information of 
ambient noise is directly incorporated in an acoustic model 
and thus high recognition performance can be achieved. 
However, the disadvantage is high calculation cost. That is, 
in the PMC method, to perform complicated calculation, a 
large-scale apparatus and a long processing time are needed. 
On the other hand, in the SS/NSS method, ambient noise is 



- 2 - 



removed when a feature value of voice data is extracted. 
Therefore, the SS/NSS method needs lower calculation cost 
than is needed in the PMC method and thus this method is now 
widely used in the art. In the SFE method, although ambient 
noise is removed when a feature value of voice data is 
extracted, as in the SS/NSS method, the extracted feature 
value is represented by a probability distribution. Thus, 
the SFE method differs from the SS/NSS method or the PMC 
method in that the SFE method extracts the feature value of 
voice in the form of a distribution in the feature space 
while the SS/NSS method and the PMC method extract the 
feature value of voice in the form of a point in the feature 
space . 

In any method described above, after extracting the 
feature value of the voice, it is determined which one of 
acoustic models corresponding to registered words or the 
like best matches the feature value, and a word 
corresponding to the best matching acoustic model is 
employed and output as a recognition result. 

A detailed description of the SFE method may be found, 
for example, in Japanese Unexamined Patent Application 
Publication No. 11-133992 (Japanese Patent Application No. 
9-300979) which has been filed by the applicant for the 
present invention. Discussions on the performance of the 
PMC method, the SS/NSS method, and the SFE method may be 
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found , for example, in the following papers: H. Pao, H. 
Honda, K. Minamino, M. Omote, H. Ogawa and N. Iwahashi, 
"Stochastic Feature Extraction for Improving Noise 
Robustness in Speech Recognition", Proceedings of the 8th 
Sony Research Forum, SRF98-234, pp. 9-14, October 1998; N. 
Iwahashi, H.Pa, H. Honda, K. Minamin and M. Omote, 
"Stochastic Features for Noise Robust in Speech Recognition" 
ICASSP'98 Proceedings, pp. 633-636, May, 1998; N. Iwahashi, H 
Pao (presented), H. Honda, K. Minamin and M. Omote, "Noise 
Robust Speech Recognition Using Stochastic Representation of 
Features", ASJ'98 — Spring Proceedings, pp. 91-92, March, 
1998; N. Iwahashi, H. Pao, H. Honda, K. Minamino and M. 
Omote, "Stochastic Representation of Feature for Noise 
Robust Speech Recognition", Technical Report of IEICE, 
pp. 19-24, SP97-97(1998-01) . 

A problem with the above-described SFE method or 
similar methods is that degradation in recognition 
performance can occur because ambient noise is not directly 
reflected in speech recognition, that is, because 
information of ambient noise is not directly incorporated in 
an acoustic model. 

Furthermore, because information of ambient noise is 
not directly incorporated in the acoustic model, degradation 
in the recognition performance becomes more serious as the 
time period from the start of speech recognition operation 
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to the start of utterance becomes longer. 

SUMMARY OF THE INVENTION 

In view of the above, it is an object of the present 
invention to provide a technique in which an acoustic model 
is corrected using information of ambient noise thereby 
preventing the recognition performance from being degraded 
as the time period from the start of speech recognition 
operation to the start of utterance becomes longer. 

According to an aspect of the present invention, there 
is provided a model adaptation apparatus comprising data 
extraction means for extracting input data corresponding to 
a predetermined model, observed during a predetermined 
interval, and then outputting the extracted data; and a 
model adaptation means for adapting the predetermined model 
using the data extracted during the predetermined interval 
by means of one of the most likelihood method, the complex 
statistic method, and the minimum distance-maximum 
separation theorem. 

According to another aspect of the present invention, 
there is provided a model adaptation method comprising the 
steps of extracting input data corresponding to a 
predetermined model, observed during a predetermined 
interval, and then outputting the extracted data; and 
adapting the predetermined model using the data extracted 
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during the predetermined interval by means of one of the 
most likelihood method, the complex statistic method, and 
the minimum distance-maximum separation theorem. 

According to still another aspect of the present 
invention, there is provided a storage medium which stores a 
program comprising the steps of extracting input data 
corresponding to a predetermined model, observed during a 
predetermined interval, and then outputting the extracted 
data; and adapting the predetermined model using the data 
extracted during the predetermined interval by means of one 
of the most likelihood method, the complex statistic method, 
and the minimum distance-maximum separation theorem. 

According to still another aspect of the present 
invention, there is provided a pattern recognition apparatus 
comprising: data extraction means for extracting input data 
corresponding to a predetermined model, observed during a 
predetermined interval, and then outputting the extracted 
data; and a model adaptation means for adapting the 
predetermined model using the data extracted during the 
predetermined interval by means of one of the most 
likelihood method, the complex statistic method, and the 
minimum distance-maximum separation theorem. 

In the model adaptation apparatus , the model adaptation 
method, the store medium, and the pattern recognition 
apparatus, according to the present invention, as described 



- 6 - 



above, input data corresponding to a predetermined model 
observed during a predetermined interval is extracted and 
output as extracted data. The predetermined model is 
adapted using the data extracted during the predetermined 
interval by means of one of the most likelihood method, the 
complex statistic method, and the minimum distance-maximum 
separation theorem, 

BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a block diagram illustrating an embodiment of 
a speech recognition apparatus according to the present 
invention; 

Fig. 2 is a diagram illustrating the operation of a 
noise observation interval extracctor shown in Fig. 1; 

Fig. 3 is a block diagram illustrating an example of a 
detailed construction of a feature extractor 5 shown in Fig. 
1; 

Fig. 4 is a block diagram illustrating an example of a 
detailed construction of a speech recognition unit 6 shown 
in Fig. 1; 

Fig. 5 is a diagram illustrating a hidden Markov model 
(HMM) ; 

Fig. 6 is a diagram illustrating feature vectors y 
obtained during a noise observation interval Tn and also 
illustrating feature distributions F^y); 
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Fig. 7 is a diagram illustrating a manner in which a 
non-speech feature distribution PDF is mapped to a 
probability distribution F s (y) corresponding to a non-speech 
acoustic model; 

Fig. 8 is a diagram illustrating a manner in which a 
non-speech acoustic model is adapted by means of the most 
likelihood method; 

Fig. 9 is a diagram illustrating feature vectors 
obtained during a noise observation interval Tn and also 
illustrating feature distributions Y t in the form of normal 
distributions N(\i tr 2 t ); 

Fig. 10 is a flow chart illustrating a process of 
adapting a non-speech acoustic model by means of the most 
likelihood method; 

Fig. 11 is a diagram illustrating a manner in which a 
non-speech acoustic model is adapted by means of the complex 
statistic method; 

Fig. 12 is a flow chart illustrating a process of 
adapting a non-speech acoustic model by means of the complex 
statistic method; 

Fig. 13 is a diagram illustrating a manner in which a 
non-speech acoustic model is adapted by means of the minimum 
distance-maximum separation theorem; 

Fig. 14 is a flow chart illustrating a process of 
adapting a non-speech acoustic model by means of the minimum 
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distance-maximum separation theorem; 

Fig. 15 is a block diagram illustrating an example of a 
construction of a non-speech acoustic model correction unit 
shown in Fig. 1; 

Fig. 16 is a diagram illustrating a manner in which 
discrete values are converted into a continuous value; 

Fig. 17 is a graph illustrating a freshness function 
F ( x ) ; and 

Fig. 18 is a block diagram illustrating an embodiment 
of a computer according to the present invention. 

DESCRIPTION OF THE PREFERRED EMBODIMENTS 

Fig. 1 illustrates an embodiment of a speech 
recognition apparatus according to the present invention. 

In this speech recognition apparatus, a microphone 1 
detects an uttered voice to be recognized together with 
ambient noise and outputs the detected voice and ambient 
noise to a conversion- into-frame unit 2. The conversion- 
into-frame unit 2 converts the voice data received from the 
microphone 1 into digital form. Furthermore, the 
conversion-into-frame unit 2 extracts the digital voice data 
in predetermined intervals (every 10 ms, for example) and 
outputs the extracted data in the form of a frame of data. 
The voice data output in units of frames from the 
conversion-into-frame unit 2 is supplied, in the form of an 
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observation vector a including as components thereof a time 
series of voice data of each frame, to a noise observation 
interval extractor 3 and a feature extractor 5 . 

Hereinafter, a tth frame of speech data, that is, 
observation vector, is denoted as a(t). 

The noise observation interval extractor 3 stores 
frames of voice data applied from the conversion-into-frame 
unit 2 into a buffer for a predetermined period of time 
(corresponding to 2N or more frames). Thereafter, as shown 
in Fig. 2, a noise observation interval with an end time t b 
at which a speech switch 4 is turned on and with a start 
time t 0 2N frames before the end time t b . An observation 
vector a for 2N frames is extracted during the noise 
observation interval and output to the feature extractor 5 
and a non-speech acoustic model correction unit 7 . In the 
present embodiment, the noise observation interval is 
divided into two sub intervals: a noise observation interval 
Tm during which a feature distribution which will be 
described later is extracted, and a noise observation 
interval Tn during which adaptation of the acoustic model is 
performed. Each of the noise observation intervals Tm and 
Tn has a length corresponding to N frames. However, it is 
not necessarily required that the lengths of the noise 
observation intervals Tm and Tn be equal to each other. 

The speech switch 4 is turned on by a user when the 
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user starts speech and is turned off when the speech is 
ended. Therefore, as can be seen from Fig. 2, before the 
speech switch 4 is turned on at a time t b , the voice data (in 
the noise observation interval) does not include uttered 
speech but includes only ambient noise. An interval from 
the time t b at which the speech switch 4 is turned on to a 
time t d at which the speech switch 4 is turned off is 
employed as a speech recognition interval during which voice 
data is subjected to speech recognition. 

On the basis of the voice data which is supplied from 
the noise observation interval extractor 3 and which 
includes only ambient noise obtained during the noise 
observation interval Tm which is the first interval of the 
two noise observation intervals Tm and Tn, the feature 
extractor 5 removes the ambient noise from the observation 
vector a which is supplied from the conversion-into-f rame 
unit 2 during the speech recognition interval starting at t b . 

The feature extractor 5 determines the power spectrum 
of the real voice data (obtained by removing the ambient 
noise) in the form of the observation vector a by means of, 
for example, a Fourier transform. The feature extractor 5 
then calculates a feature vector y including, as its 
components, frequency components of the power spectrum. The 
calculation method of the power spectrum is not limited to 
those based on the Fourier transform, but the power spectrum 
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may be determined by another method such as a filter bank 
method . 

Thereafter, on the basis of the feature vector y and 
the ambient noise during the noise observation interval Tm, 
the feature extractor 5 calculates a parameter Z indicating 
the distribution, in the space of feature values (feature 
space), of a feature value which is obtained by mapping the 
real voice included in the voice data in the form of the 
observation vector a (hereinafter, such a parameter will be 
referred to as a feature distribution parameter). The 
resultant feature distribution parameter Z is supplied to a 
speech recognition unit 6 . 

Fig. 3 illustrates an example of a detailed 
construction of the feature extractor 5 shown in Fig. 1. 
The observation vector a input to the feature extractor 5 
from the conversion-into-f rame unit 2 is applied to a power 
spectrum analyzer 11. In the power spectrum analyzer 11, 
the observation vector a is subjected to a Fourier transform 
based on, for example, a FFT (fast Fourier transform) 
algorithm thereby extracting a feature vector in the form of 
a power spectrum of a voice. Herein, it is assumed that an 
observation vector a in the form of one frame of voice data 
is converted into a feature vector consisting of M 
components (M-dimensional feature vector). 

Herein, a feature vector obtained from a tth frame of 
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observation vector a(t) is denoted by y(t). Furthermore, a 
spectrum component of real voice of a feature vector y(t) is 
denoted by x(t) and a spectrum component of ambient noise is 
denoted by u(t). Thus, the spectrum component of real voice, 
x(t), is given by the following equation (1). 

x(t) = y(t) - u(t) (1) 

Herein, it is assumed that the characteristic of the 
ambient noise can vary irregularly and it is also assumed 
that the voice data in the form of an observation vector 
a(t) consists of a real voice component plus ambient noise. 

The ambient noise which is input as voice data to the 
feature extractor 5 from the noise observation interval 
extractor 3 during the noise observation interval Tm is 
applied to a noise characteristic calculation unit 13. The 
noise characteristic calculation unit 13 determines the 
characteristic of the ambient noise during the noise 
observation interval Tm. 

Assuming that the distribution of the power spectrum 
u(t) of the ambient noise during the speech recognition 
interval is the same as (or similar to) that of the ambient 
noise during the noise observation interval Tm immediately 
before the speech recognition interval, and further assuming 
that the distribution is a normal distribution, the noise 
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characteristic calculation unit 13 calculates the mean value 
(mean vector) and the variance (variance matrix) (co- 
variance matrix) of the ambient noise such that the normal 
distribution is represented by the mean value and the 
variance . 

The mean vector \i' and the variance matrix 2' can be 
given by the following equation (2). 

i*' to - h 2 y(t) (i) 

, ( 2 ) 

2 ' f 1 ' 3> = ^ S & ~ V (i) ) (y<t) <3) - M-' (j) ) 

where n'(i) denotes an ith component of the mean vector \x' 

(i = 1, 2, , M), y(t)(i) denotes an ith component of a 

tth frame of feature vector, and j) denotes a 

component in an ith row and a jth column of the variance 
matrix 2' (j = 1, 2, ... r M) . 

Herein, for simplicity in calculation, the respective 
components of the feature vector y of the ambient noise are 
assumed to have no correlation with each other. In this 
case, the components of the variance matrix 2' become 0 
except for diagonal components, as shown below. 



2' (i, j) = 0, i * j 



(3) 
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The noise characteristic calculation unit 13 determines, 
in the above-described manner, the mean vector \i' and the 
mean value Z' which define a normal distribution 
representing the ambient noise characteristic, and supplies 
the result to a feature distribution parameter calculation 
unit 12. 

The feature vector y of the uttered voice containing 
ambient noise, output from the power spectrum analyzer 11, 
is also supplied to the feature distribution parameter 
calculation unit 12. In the feature distribution parameter 
calculation unit 12, feature distribution parameters 
representing the distribution (estimated distribution) of 
the power spectrum of the real voice are calculated from the 
feature vector y supplied from the power spectrum analyzer 
11 and the ambient noise characteristic supplied from the 
noise characteristic calculation unit 13. 

That is, in the feature distribution parameter 
calculation unit 12, assuming that the power spectrum of the 
real voice has a normal distribution, the mean vector | and 
the variance matrix thereof are determined as the feature 
distribution parameters in accordance with equations (4) to 
(7) shown below. 
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Kt) (i) = e wt) (i) ] 

= E [y(t) (i) - u(t) (i) ] 

_ y( t) (pjf t} (i) mv (i) put) (i) - jT (i) Kt) mm <i) )du(t) (i) 

= jf t,(i) P(u(t) (i))du(t) (i) 

jf t)(i) u(t) (i)P(u(t) (i) )du(t) (i) 
= Y(tHl) " (i))du( t ) (i) 



W(t)(i,j) = V[x(t) (i) ] 

= E[(x(t)(i)) 2 - (E[x(t)(i)]) 2 (5) 
{= E[x(t)(i)) 2 ] - (^(t)(i)) 2 ) for i = j 
W(t) (i, j) =0 for i * j 



E [ (x(t) (i) ) 2 ] = E [ (y(t) (i) - Kt) (i) ) 2 ] 

) jY(t) (: 

x {(y(t) (i) ff Q (t) (i) P(u(t) (i) )du(t) (i) 



f y(t) (i) <y(t) (i) - u(t) (i) f y(t)( i) B(UCt) T dKt) (i) 

Jo ^ J y( >( 5 P(Kt) (i) )dKt) (i) 



jy(t) (i) p(u(t) (i) (i) 



- 2y(t) (i)jf ] w Kt) (i)P(Kt) (i) )dKt) (i) ( 6 ) 

+ jf tJ (i) (Kt) (i)) 2 P(Kt) (i) )dKt) (i)} 

f y(t) (i) Kt) (i)P( Kt) (i) )dKt) (i) 
= (y(t) (i) f - 2y(t) (i) - r a )p(u(t)(i))du(t)(i) 

jf t] (i) (Kt) (i) ) 2 B(Kt) (i) )dKt) (i) 
/f t)(i) P(Kt) (i)>iKt) (i) 
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in the above equations, |(t)(i) denotes the ith 
component of the mean vector %t in the tth frame, E[] 
denotes the mean value of values enclosed in square brackets 
([]). x(t)(i) denotes the ith component of the power 
spectrum x(t) of the real voice in the tth frame, u(t)(i) 
denotes the ith component of the power spectrum of the 
ambient noise in the tth frame, and P(u(t)(i)) denotes the 
probability that the ith component of the power spectrum of 
the ambient noise in the tth frame is u(t)(i). Because the 
ambient noise is assumed to have a normal distribution, 
P(u(t)(i)) is given by equation (7) described above. 

W(t)(i,j) denotes a component in the ith row and the 
jth column of the variance ^(t) in the tth frame. V[ ] 
denotes the variance of values enclosed in square brackets 
([])• 

As described above, the feature distribution parameter 
calculation unit 12 determines, for each frame, the feature 
distribution parameters including the mean vector ^ and the 
variance matrix W so as to represent the distribution of the 
real voice in the feature vector space (assuming that the 
distribution of the real voice in the feature vector space 
can be represented by a normal distribution). 
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Thereafter, the feature distribution parameter 
determined for each frame during the speech recognition 
interval is output to the speech recognition unit 6 . For 
example, when a speech recognition interval includes T 
frames and feature distribution parameters for the 
respective T frames are given by z(t) = {| (t), ^(t)} (t = 1, 
2, T), the feature distribution parameter (in the form 

of a series) Z = {z(l), z(2), . .., z(T)} is supplied from 
the feature distribution parameter calculation unit 12 to 
the speech recognition unit 6. 

Referring again to Fig. 1, the speech recognition unit 
6 classifies the feature distribution parameter Z received 
from the feature extractor 5 into one of a predetermined 
number (K) of acoustic models or one non-speech acoustic 
model (acoustic model representing a state in which no voice 
is present but only ambient noise is present), and the 
resultant model is output as a recognition result of the 
input voice. More specifically, the speech recognition unit 
6 stores an identification function corresponding to a non- 
speech interval (that is, a function indicating whether a 
given feature parameter Z should be classified into the non- 
speech acoustic model) and identification functions 
respectively corresponding to a predetermined number (K) of 
words (that is, functions indicating which acoustic model a 
feature parameter Z should be classified into). The speech 
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recognition unit 6 calculates the values of the 
identification functions corresponding to the respective 
acoustic models by employing, as the argument, the feature 
distribution parameter Z supplied from the feature extractor 
5. The speech recognition unit 6 selects an acoustic model 
(a word or non-speech (noise)) having the greatest function 
value (that is, the greatest score) and outputs the selected 
acoustic model as the recognition result. 

Fig. 4 illustrates an example of a detailed 
construction of the speech recognition unit 6 shown in Fig. 
1. The feature distribution parameter Z input from the 
feature distribution parameter calculation unit 12 of the 
feature extractor 5 is supplied to identification function 
calculation units 21-1 to 21-K and also to an identification 
function calculation unit 21-s. Each identification 
function calculation unit 21-k (k = 1, 2, . . . , K) stores an 
identification function G k (Z) for discriminating a word 
corresponding to a /cth acoustic model of the K acoustic 
models, and calculates the identification function G k (Z) by 
employing, as an argument, the feature parameter Z supplied 
from the feature extractor 5. The identification function 
calculation unit 21-s stores an identification function G S (Z) 
for discriminating a non-speech interval corresponding to 
the non-speech acoustic model and calculates the 
identification function G S (Z) by employing, as an argument, 
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the feature parameter Z supplied from the feature extractor 
5. 

The speech recognition unit 6 discriminates 
(recognizes) a class indicating a word or a non-speech state, 
using, for example, a HMM (Hidden Markov Model) method. 

The HMM method is described below with reference to Fig. 
5. In Fig. 5, the HMM includes H states q x to q H wherein 
state transition is allowed only from one state to that 
state itself or a right state immediately adjacent that 
state. The leftmost state q 1 is an initial state, and the 
rightmost state q H is an end state. Transition from the end 
state q H is not allowed. The model in which state transition 
to the left is forbidden is called a left-to-right model. 
In general, a left-to-right model is used in speech 
recognition. 

Herein, a model for discriminating k classes of a HMM 
is referred to as a k-class model. A k-class model can be 
defined by a probability (initial state probability) Jt k (q H ) 
of being initially present in a state q H , a probability 
(transition probability) a k (qi, q t ) of transition from a 
state q A at a time (frame) t to a state q.j at a time t + 1, 
and a probability (output probability) b k (qi)(0) for a state 
q ± to output a feature vector 0 when a transition from that 
state qi occurs (wherein h = 1, 2, H) . 

When a series of feature vectors 0 lf 0 2 , ... is given, a 
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class of a model which gives the greatest probability 
(observation probability) that such a series of feature 
vectors is observed is employed as a recognition result of 
the series of feature vectors. 

Herein, the observation probability is determined by 
the identification function G k (Z). The identification 
function G k (Z) indicates the probability that a (series of) 
feature distribution parameters Z = {z lf z 2 , z T } is 

observed in an optimum state series (optimum manner in which 
state transitions occurs) for such a (series of) feature 
distribution parameter Z = {z lf z 2 , z T }, and is given 

the following equation (8). 

G k (Z) = max n^qj • b k (qj (z x ) • a^, q 2 ) • b^(q 2 ) (z 2 ) 

qi ' q2 qr (8) 

• ' • a k (q T _ lf q T ) • b k (q T ) (z T ) 

where b k '(q i )(z j ) denotes the output probability when the 
output has a distribution represented by Zj. Herein, the 
output probability b k (s)(O t ) of outputting a feature vector 
when a state transition occurs is represented by a normal 
distribution function on the assumption that there is no 
correlation among components in the feature vector space. 
In this case, when an input has a distribution represented 
by z t , the output probability b k '(s)(z t ) can be determined 
using a probability density function P k m (s)(x) defined by a 
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mean vector \i k (s) and a variance matrix 2 k (s) and also using 
a probability density function P f (t)(x) representing the 
feature vector (power spectrum in this case) x of a tth 
frame, in accordance with the following equation. 

b^sMZJ = /P f (t){x)P k m (s)(x)dx 

= fT P(s) (i) (|(t) (i) , ^(t) (i, i) ) (9) 

i = l 

k = l,2,...,K:s = q^qj,...^^! = 1, 2, . . . , T 

In equation (9), the integration is performed over the 
entire M-dimensional feature vector space (power spectrum 
space in this case) . 

Furthermore, in equation (9), p(s)(i)( |(t)(i), W(t)(i, 
i)) is given by the following equation. 

P(s) (i) (Kt) (i) , W(t) (i, i) ) 

(Wc(s)(j-)-g(t)(i)) 2 (10) 
^ t e 2(2 k (B) <i,i)+<P(t) (i,i> ) 

V2ji(2 k (s) (i, i) + *P(t) (i, i) ) 

where |j, k (s)(i) denotes an ith component of the mean vector 
^i k (s), and 2 k (s)(i, i) denotes a component in an ith row and 
ith column of the variance matrix 2 k (s). Thus, the output 
probability of a k-class model can be defined in the above- 
described manner. 

As described above, the HMM is defined by the initial 
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state probability Ji k (q H )/ the transition probability a k (q if 
q^) , and the output probability b k (qi)(0), and these 
probabilities are determined in advance from feature vectors 
calculated from learning voice data. 

In the case where the HMM shown in Fig. 5 is employed, 
because transition starts from the leftmost state q lf the 
initial state probability for the state q x is set to 1, and 
the initial state probabilities for the other states are set 
to 0. As can be seen from equations (9) and (10), if ^(t^i, 
i) is set to 0, then the output probability becomes equal to 
that of continuous HMM in which the variance of feature 
vectors is not taken into account. 

For a learning method of the HMM, for example, the 
Baum-Welch re-estimation method is known. 

Referring again to Fig. 4, each identification function 
calculation unit 21-k (k = 1, 2, K) stores an 

identification function G k (Z) given by equation (8) defined 
by initial state probabilitiesjt k (q H ) , transition 
probabilities a k (qi, qj), and output probabilities b k (qi)(0), 
which are determined in advance by means of learning for a 
k-class model, and each identification function calculation 
unit 21-k calculates the identification function G k (Z) by 
employing, as the argument, the feature distribution 
parameter Z supplied from the feature extractor 2 and 
outputs the calculated function value G k (Z) (observation 



- 23 - 



probability) to a decision unit 22. The identification 
function calculation unit 21-s stores an identification 
function G S (Z) which is similar to the identification 
function G k (Z) given by equation (8) and which is defined by 
initial state probabilities 3t s (q h ), transition probabilities 
a s (<3i/ <3j)/ and output probabilities b s (qi)(0), which are 
supplied from the non-speech acoustic model correction unit 
7. The identification function calculation unit 21-s 
calculates the identification function G S (Z) by employing, as 
the argument, the feature distribution parameter Z supplied 
from the feature extractor 2 and outputs the resultant 
function value G S (Z) (observation probability) to the 
decision unit 22. 

The decision unit 22 determines which class (acoustic 
model ) the feature distribution parameter Z , that is , the 
input voice belongs to, by applying, for example, a decision 
rule shown in equation (11) to the respective function 
values G k (Z) (including G S (Z)) output from the identification 
function calculation unit 21-s and the identification 
function calculation units 21-1 to 21-k. 

qz) = c k ,if G k (Z) = max^Z)} (11) 



where C(Z) denotes a function which indicates a class to 
which the feature distribution parameter Z belongs. In 
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equation (11), max on the right-hand side in the second 
equation denotes the maximum value of function values G^Z) 
following max (where i = s, 1, 2, K). 

If the decision unit 22 determines the class in 
accordance with equation (11), the decision unit 22 outputs 
the resultant class as a recognition result of the input 
voice. 

Referring again to Fig. 1, the non-speech acoustic 
model correction unit 7 creates a new identification 
function G S (Z) for adapting the non-speech acoustic model 
stored in the speech recognition unit 6, on the basis of 
ambient noise represented by voice data which is extracted 
during the noise observation interval Tn, that is the second 
interval of the two noise observation intervals Tm and Tn, 
and which is supplied from the noise observation interval 
extractor 3. Using this new identification function G S (Z), 
the non-speech acoustic model correction unit 7 adapts the 
non-speech acoustic model stored in the speech recognition 
unit 6 . 

More specifically, in the non-speech acoustic model 
correction unit 1, as shown in Fig. 6, a feature vector y is 
observed for each of N frames of the voice data (ambient 
noise) during the noise observation interval Tn supplied 
from the noise observation interval extractor 3 , and a 
feature distribution such as that shown in the following 
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equation is created in a similar manner as performed by the 
feature extractor 5. 

{Fi(y)r F 2 (Y)> F K (y)} (12) 

Herein, the feature distribution {F i (y) f i = 1, 2, 
N} is a probabilistic density function and will also be 
referred to as a non-speech feature distribution PDF . 

The non-speech acoustic model correction unit 7 maps 
the non-speech feature distribution PDF to a probability 
distribution F s {y) corresponding to a non-speech acoustic 
model in accordance with the following equation, as shown in 
Fig. 7. 

F s (y) = V(F 1 (y), F 2 (y), F H ( y ) ) (13) 

where V is a correction function (mapping function) which 
maps the non-speech feature distribution PDF (F^y), i = 1, 
2, M} to a non-speech acoustic model F S (X) . 

The non-speech acoustic model correction unit 7 updates 
the non-speech acoustic model stored in the speech 
recognition unit 6 using F s (y) so as to adapt the non-speech 
acoustic model . 

Herein, if it is assumed that the probability 
distribution F s (y) representing the non-speech acoustic model 
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is given by a normal distribution with a mean value of \x s and 
a co-variance matrix of 2 S , and if it is assumed that there 
is no correlation among components of the feature vector y 
of each frame, then the co-variance matrix 2^ of the non- 
speech feature distribution PDF {F^y), i = 1, 2, N} 
becomes a diagonal matrix. However, it is required as a 
prerequisite that the co-variance matrix of the non-speech 
acoustic model be also a diagonal matrix. Hence, if there 
is no correlation among components of the feature vector y 
of each frame in the noise observation interval Tn, the non- 
speech feature distribution PDF {F^y) , i = 1, 2, N} 
becomes a normal distribution Nf^, having a mean value 

and a variance corresponding to each component. Herein, \i L 
denotes the mean value of F^y) and 2 £ denotes a co-variance 
matrix of F^y ) . 

On the assumption described above, the non-speech 
acoustic model correction unit 7 adapts the non-speech 
acoustic model F s (y) using the non-speech feature 
distribution PDF by means of a most (maximum) likelihood 
method, a complex (mixed) statistic method, or a minimum 
distance-maximum separation theorem (minimum distance 
method ) . 

When the non-speech acoustic model is adapted using the 
most likelihood method, a normal distribution N(^ s , 2 S ) 
containing the non-speech feature distributions PDF {F L (y), i 
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= 1, 2, . . . ,N> is determined as a non-speech acoustic model, 
as shown in Fig. 8. 

Herein, as shown in Fig. 9, M-dimensional feature 
vectors y obtained from a tth frame during the noise 
observation interval Tn are denoted by (y^t), 
y 2 (t), . ..,y M (t)). Furthermore, a feature distribution 
obtained from the feature vectors (yi(t), y 2 {t), • ••,y M (t)) 
is denoted by Y t , and a normal distribution representing the 
feature distribution is denoted by N(fx t , 2 t ) . 

In the most likelihood method, a measure L indicating 
the degree to which non-speech feature distributions Y lf 
Y 2 ,..., Y N are observed is defined using the non-speech 
acoustic model F s (y) represented by the normal distribution 
N(^ s , 2 S ), for example, as shown in the following equation. 

L4 log Pr (Y lf Y 2 ,...,Y t ,..., Y N | mx s , Z s ) ) ( 14 ) 

where log denotes a natural logarithm, and Pr(Y 2 , Y 2 , Y N 
| N(fx s , 2 S )) denotes the probability that a series of non- 
speech feature distributions Y ir Y 2 , Y N is observed from 
the non-speech acoustic model N(^i s , 2 2 ) (= F s (y)). 

Herein, if it is assumed that the non-speech feature 
distributions Y x , Y 2 , Y H are independent of each other, 

the measure L in equation (14) can be given by the following 
equation. 
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L = log nPr(Y t |N[ ( i B/ 2 s ) ) 

n (15) 
= 2l°9l>3r<Y t jN(n s ,5: s )) 

t=i 

When the measure L given by equation (15) (or equation 

(14) ) has a large value, the possibility becomes high that 
the non-speech feature distributions Y lf Y 2 , . .., Y N are 
observed from the non-speech acoustic model. Therefore, the 
non-speech acoustic model can be updated (adapted) properly 
by employing a non-speech acoustic model N(ja s , 2 S ) which 
gives a greatest (maximum) measure L represented by equation 

(15) . Thus, it is needed to determine the mean value and 
the variance 2 S of the normal distribution N(^ s , 2 S ) 
representing the non-speech acoustic model F s (y) so that the 
normal distribution defined by \i s and 2 S results in the 
maximum measure L given by equation (15). If the measure L 
given by equation (14) is partially differentiated with 
respect to the mean value jx s and the variance 2 S , 
respectively, the partial derivatives of the measure I» 
become 0 at the values of the mean value \i s and variance 2 S 
at which the measure L becomes maximum. Therefore, the 
values of the mean value \i s and variance 2 S can be determined 
by solving equation (16). 
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(16) 



Herein, if the values of the mean value \i s and the 
variance 2 S which satisfy the equation (16) are represented 
by equation (17), then the correction function (mapping 
function) V given by equation (13) is defined by the 
functions and V 2 in equation (17). 



To solve equation (16), the mean value ^ t and the 
variance (variance matrix) 2 t which define the non-speech 
feature distribution N(n t/ 2 t ) (= Y t ) are represented by the 
following equation (18). 




(17) 



>i(t)- 
Mt) 



o^(t) o? 2 (t) - c£ M (t) 
a 2 21 (t) a 2 22 (t) - c£ M (t) 



(18) 



H< M (t) 



where t is an integer which can take a value of 1 to N, and 
a i;j 2 (t) denotes a co-variance between i- and j -dimensional 
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vectors . 

As described earlier, because the components of the 
feature vectors of the respective frames are assumed to have 
no correlations with each other, the co-variance matrix 2 t of 
the non-speech feature distribution N(m, 2 t ) becomes a 
diagonal matrix, and thus, of the components of 2 t in 
equation (18), those components (co-variance) with i and j 
which are different from each other become 0. Thus, the co- 
variance matrix 2 t can be represented by the following 
equation. 



°i\(t) 



M (t) 



(19) 



Similarly, the mean value \i s and the variance (variance 
matrix) 2 S of the non-speech feature distribution N([x s , 2 S ) 
are represented by the following equation (20). 



■ [^(S)" 




o^(s) o^ 2 (s)- 


• <M<S) 








o 2 21 (s) o 2 22 (s)-- 


• a 2 rM ( S ) 


(20 






CT M,1< S > °M, 2 ( S )- 
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Also in this case, the co-variance matrix 2 S of the non- 
speech feature distribution N(n s , 2 S ) is assumed to be a 
diagonal matrix as described earlier. Therefore, of the 
components of 2 S in equation (20), those components (co- 
variance) with i and j which are different from each other 
become 0. Thus, as in the case of equation (19), the co- 
variance matrix 2 S can be represented by the following 
equation. 



Herein, for simplicity, some suffixes of the components 
of the co-variance 2 t in equation (19) are removed, and the 
mean value \i t and the variance matrix 2 t which define the 
non-speech feature distribution N(jx t , 2 t ) are represented by 
the following equation. 



'tfii(s) 



0 



2 S = 



(21) 



0 




Mt) 



H<t) 



o 



(22) 



2 t = 



0 
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where t = 1, 2, . . . , N . 

Similarly, some suffixes of the components of the co- 
variance 2 S in equation (21) are removed, and the mean value 
ji s and the variance matrix 2 S which define the non-speech 
feature distribution N(u. s , Z s ) are represented by the 
following equation. 



Hm( s ) 



(23) 



Herein, if the non-speech feature distribution Y t (= 
N(n t , 2 t )) in equation (15) is regarded as a probability 
density function defined by the mean value |i t and the 
variance matrix 2 t , and the non-speech feature distribution 
N (V s r 2 S ) is regarded as a probability density function 
defined by the mean value \i s and the variance matrix 2 S , then 
the measure L in equation (15) can be calculated as follows. 
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N l r i I 

t? 1 1 ° g (2 J t) M/2 |2 t + Z s | 1/2 ' SXP [" 2 (lAt ~ ^ s)T<2t + " M 

1 l , ,i 

-(ji t - M. 8 ) T (2 t + 2 e )~\\h ~ M 

-\ MN log 2, - 1 J f log(a 2 (t) + a 2 (s) ) - \ J f *ff> f 
^ ^ t=ik=i 2 t =ik=i a k (t) + o k (s) 

(24) 



Herein, (2 t + ZJ" 1 in equation (24) can be given by the 
following equation. 



<2t + 



O^(t) + O^S) 



<(t) + o£(s), 



(25) 



If the measure L represented by equation (24) is 
partially differentiated with respect to the mean value \x s 
and the variance matrix Z 8 as shown in Fig. (16), then the 
following equation is obtained. 

« H k (t) - n k (s) 
Aa 2 (t) + a 2 (s) 

1 f 1 1 ^ (fa(t) - fa(s) ) 2 (26) 

2 A a 2 (t) + a 2 (s) 2 A(a 2 (t) + a 2 (s) ) 2 

where k = 1 , 2 , . . . , M. 



I3L 
5^ k (s) 
3o 2 (s) 
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From equation (26), the mean value \x s (|x s (s), n s (s), 
jx M (s)) and the variance 2 S (a s 2 (s), a 2 2 (s), a M 2 (s)) can be 

determined by solving the following equation. 



[t4U'(t) + a£(s) (o£(t) + o=(s) f) 

Hereinafter, suffixes of Mt), Ms)/ o k 2 (t), a k 2 (t) in 
equation (27) are represented in simplified fashions as 
shown below in equation (28). 

f*t = M*(t) 

^ s = Ms) 

V t = a k 2 (t) 

V s = a k 2 (s) (28) 

Thus, equation (27) can be written as follows. 




(27) 




(29) 



Equation (29) can be rewritten as follows. 
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ttiv t + v s 

N 1 



t -iV t + v s 



(30) 



(v t + v e y 



In equation (30), in order to obtain [i 3 , it is needed to 
determine v s . v s may be determined, for example, using the 
Newton descent method or the Monte Carlo method. 

In the case where the non-speech acoustic model is 
adapted by means of the most likelihood method described 
above, the non-speech acoustic model correction unit 7 
performs a process (model adaptation process) according to, 
for example, a flow chart shown in Fig. 10. 

In step SI, a non-speech feature distribution F t (y) (= 
N (K/ 2 t )) is determined from voice data (noise) during a 
noise observation interval Tn. Then in step S2, the 
variance v s in equation (30) is determined by means of the 
Newton descent method or the Monte Carlo method to obtain 
the value of the variance v s which maximizes the measure L 
represented by equation (15) indicating the degree to which 
the series of non-speech feature distributions is observed. 
Furthermore, in step S3, the mean value [i s is determined 
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using the variance v s determined in step S2 in accordance 
with equation (30). Thereafter, the process proceeds to 
step S4. In step S4, an identification function G S (Z) 
corresponding to a normal distribution N(^ s , v s ) defined by 
the mean value \x s determined in step S3 and the variance v s 
determined in step S2 is created. The identification 
function of the identification function calculation unit 21- 
s in the speech recognition unit 6 (Fig. 4) is updated by 
the created identification function G S (Z), and the process is 
ended . 

In the case where the non-speech acoustic model is 
adapted by means of the complex statistics, a plurality of 
statistics, that is, a set of non-speech feature 
distributions {(F A (y), i = 1, 2, N} are combined as 

shown in Fig. 11, and the resultant complex statistic, that 
is, the normal distribution N(ji s , 2 S ) obtained as a result is 
used to update the non-speech acoustic model F s (y). 

When the complex statistic is used, the measure L 
indicating the degree to which non-speech feature 
distributions F x (y), F 2 (y), F N ( y ) are observed in the 

noise observation interval T N is defined using the non-speech 
model F s (y) represented by the normal distribution N(|n s , 2 S ) 
as shown in the following equation. 



L = log f] EfF^y) ) ) 

N 

= 2 log / Q . F B (y) • Fi(y)dy 



= 2 



2(v s + v A ) 



-log(2jt) M (v s + Vi ) 



1 J N 1 N fix 

- MN log 2k - - J, log(v s + - - J ~ 
^ ^ i-l 2 i=1 V 



(31) 



In equation (31), F 8 (Fi(y)) is a complex statistic, and 
E() represents an expected value of the variable enclosed in 
parentheses. The integration represented in the second row 
of equation (31) is performed over the entire feature vector 
space Q i (power spectrum space in this specific embodiment) 
of M-dimensional feature vectors y used to obtain the non- 
speech distribution F A (y). Furthermore, the modification 
from the second row to the third row in equation (31) can be 
accomplished by regarding the non-speech feature 
distribution F ± (y) (= N(^ t , 2 t ) )as a probability density 
function defined by the mean value ^ t and the variance matrix 
2 t and regarding the non-speech feature distribution F„(X) (= 
N (JA S ' 2 s ))as a probability density function defined by the 
mean value n s and the variance matrix 2 a . 

The updating (adapting) of the non-speech acoustic 
model can be performed employing a non-speech model N([x E , 2 S ) 
which results in the greatest (maximum) value for the 
measure L represented by equation (31). If the measure L 
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given by equation (31) is partially differentiated with 
respect to the mean value and the variance v s , 
respectively, the partial derivatives of the measure L 
become 0 at the values of the mean value \i s and variance v £ 
at which the measure L becomes maximum. Thus, the mean 
value fx s and the variance v s (=a s 2 ) can be determined by 
solving equation (32). 



Substituting the measure L given by equation (31) into 
equation (32) yields equation (33). 




(32) 




(33) 



Equation (33) can be rewritten as follows. 
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H s = 





2\ 



(34) 



N 



1 



V = 



i-l 



]_ 



0 



in equation (34), in order to determine \i sf it is needed 
to determine v s . v s may be determined, for example, using 
the Newton descent method or the Monte Carlo method, as in 
the case where the most likelihood method is employed. 

In the case where the non-speech acoustic model is 
adapted by means of the complex statistic method described 
above, the non-speech acoustic model correction unit 7 
performs a process (model adaptation process) according to, 
for example, a flow chart shown in Fig. 12. 

In step Sll, a non-speech feature distribution F t (y) {= 
N(fx t , 2 t )) is determined from voice data (noise) during a 
noise observation interval Tn. Then in step S12, the 
variance v s in equation (34) is determined by means of the 
Newton descent method or the Monte Carlo method to obtain 
the value of the variance v s which maximizes the measure L 
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represented by equation (31) indicating the degree to which 
the series of non-speech feature distributions is observed. 
Furthermore, in step S13, the mean value is determined 
using the variance v s determined in step S12 in accordance 
with equation (34). Thereafter, the process proceeds to 
step S14. In step S14, an identification function G S (Z) 
corresponding to a normal distribution N(^ s / v s ) defined by 
the mean value jx s determined in step S13 and the variance v s 
determined in step S12 is created. The identification 
function of the identification function calculation unit 21- 
s in the speech recognition unit 6 (Fig. 4) is updated by 
the created identification function G S (Z), and the process is 
ended . 

In the case where the non-speech acoustic model is 
adapted by means of the minimum distance-maximum separation 
theorem, the non-speech acoustic model F s (y) is updated by a 
normal distribution N(ji s , 2 s ) which minimizes the sum of 
distances d lf d 2 , d N from the respective non-speech 

feature distributions in the form of normal distributions 
Fi(Y) (= N((i 2 , 2 2 )), F 2 (y) (= N(n 2 , Z 2 ) ) , F N (y) (= N(fi N , 2 

N ) ) * 

The distance d i:j between a certain normal distribution 
N(u,i, and another normal distribution N(fj. j , Z 5 ) may be 
represented using, for example, a Bhattacharyya distance or 
a Mahalanobi distance. 
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When a Bhattacharyya distance is employed, the distance 
between a normal distribution N(fx if 2J and a normal 
distribution H(\i if 2^ is given by the following equation. 

1 JX^+JA- 1 l ^ + 2^/2] (35) 

= 8 " N [—J—j ^ ~ ^ + 2 l0g | 2l | 1/2 - 

When a Mahalanobi distance is employed, the distance d i;j 
between a normal distribution N(n if 2J and a normal 
distribution N(n if 2 ± ) is given by the following equation. 

= - M-jflTVi - ^) (2. = 2.) 

The Mahalanobi distance d i;j given by equation (36) is 
determined on the assumption that two distributions the 
distance between which is to be determined have the same 
variance, that is, on the assumption that the co-variance 
matrix 2 ± of the normal distribution N(^i i , 2 ± ) is identical 
to the co-variance matrix 2j of the normal distribution N(n> j# 
2j) (2 t = 2 y = 2). Therefore, when the Mahalanobi distance is 
employed, a restriction is imposed upon the N(jx i , 2 ± ) 
representing the non-speech feature distribution F^y). 

In the present embodiment, for the above reason, the 



- 42 - 



Bhattacharyya distance given by equation (35) is employed. 

In the case where the minimum distance-maximum 
separation theorem is employed, the measure L indicating the 
degree to which non-speech feature distributions Y X (X), 
Y 2 {X) , Y N (X) are observed during the noise observation 

interval Tn is defined using the non-speech acoustic model 
F 3 (y) represented by the normal distribution N(n s , 2 S ), for 
example, as shown in the following equation. 



When the measure L given by equation (37) becomes 
minimum, the distance between the normal distributions of 
the non-speech feature distribution and the non-speech 
acoustic model becomes minimum. Therefore, the non-speech 
acoustic model should be updated (adapted) by employing a 
non-speech acoustic model N((i s , 2 S ) which results in a 
smallest (minimum) value for the measure L represented by 
equation (37). If the measure L given by equation (31) is 
partially differentiated with respect to the mean value ^ s 
and the variance o s 2 , respectively, the partial derivatives 




M o\ + ol) 
— • 1 

2 °"i a s J 

r- + 0-3 1 N M 



H + — log — 

a i° s J 2 2 



(37) 
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of the measure L become 0 at the values of the mean value 
and variance a s 2 at which the measure L becomes minimum. 
Therefore, the values of the mean value fi s and variance a s 2 
can be determined by solving equation (38). 



f dL 
1 dL 



(38) 



Substituting the measure L given by equation (37) into 
equation (38) yields equation (39). 



• o 2 + o 2 = 

c <39> 

Thus, from equation (39), the following equation (40) can be 
obtained. 
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Og(a- + o 2 s ) 




= 0 



(40) 



In equation (40), in order to determine the mean value 
\i s , it is needed to determine the variance o s 2 . o s 2 may be 
determined, for example, using the Newton descent method or 
the Monte Carlo method, as in the case where the most 
likelihood method is employed. 

In the case where the non-speech acoustic model is 
adapted by means of the minimum distance-maximum separation 
theorem, the non-speech acoustic model correction unit 7 
performs a process (model adaptation process) according to, 
for example, a flow chart shown in Fig. 14. 

In step S21, a non-speech feature distribution F t (y) (= 
N(jx t , 2 t ) ) is determined from voice data (noise) during a 
noise observation interval Tn. Then in step S22, the 
variance o 3 2 in equation (40) is determined by means of the 
Newton descent method or the Monte Carlo method to obtain 
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the value thereof which maximizes the measure L represented 
by equation (37) indicating the degree to which the series 
of non-speech feature distributions is observed. 
Furthermore, in step S23, the mean value |x s is determined 
using the variance o s 2 determined in step S22 in accordance 
with equation (40). Thereafter, the process proceeds to 
step S24. In step S24, an identification function G S (Z) 
corresponding to a normal distribution defined by the mean 
value \x s determined in step S23 and the variance a s 2 
determined in step S22 is created. The identification 
function of the identification function calculation unit 21- 
s in the speech recognition unit 6 (Fig. 4) is updated by 
the created identification function G S (Z), and the process is 
ended . 

The operation of the speech recognition apparatus shown 
in Fig. 1 is described below. 

Voice data (voice to be recognized, including ambient 
noise) is detected by a microphone 1 and input to a 
conversion-into-frame unit 2. The conversion-into-f rame 
unit 2 converts the voice data into the form of frames . 
Frames of voice data are sequentially supplied, as an 
observation vector a, to the noise observation interval 
extractor 3 and the feature extractor 5. The noise 
observation interval extractor 3 extracts voice data 
(ambient noise) during noise observation intervals Tm and Tn 
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immediately before a time t b at which the speech switch 4 is 
turned on. The extracted voice data is supplied to the 
feature extractor 5 and the non-speech acoustic model 
correction unit 7. 

The non-speech acoustic model correction unit 7 updates 
(adapts) a non-speech acoustic model on the basis of the 
voice data representing the ambient noise during the noise 
observation intervals Tm and Tn by means of one of the most 
likelihood method, the complex statistic method, and the 
minimum distance-maximum separation theorem, described above. 
The resultant updated non-speech acoustic model is supplied 
to the speech recognition unit 6. The voice recognition 
unit 6 replaces an identification function corresponding to 
a non-speech acoustic model which has been maintained until 
that time with an identification function corresponding to 
the non-speech acoustic model supplied from the non-speech 
acoustic model correction unit 7 thereby adapting the non- 
speech acoustic model. 

On the other hand, the feature extractor 5 performs 
acoustic analysis upon the voice data in the form of the 
observation vector a supplied from the conversion-into-f rame 
unit 2 to determine the feature vector y thereof. The 
feature extractor 5 then calculates the feature distribution 
parameter Z representing the distribution in the feature 
vector space on the basis of the obtained feature vector y 
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and the voice data (ambient noise) extracted during the 
noise observation interval Tm. The calculated feature 
distribution parameter Z is supplied to the speech 
recognition unit 6. The speech recognition unit 6 
calculates the values of the identification functions of the 
acoustic models corresponding to a non-speech state and the 
predetermined number (K) of words, respectively, using the 
feature distribution parameter supplied from the feature 
extractor 5. The acoustic model corresponding to the 
function having the maximum value is output as the result of 
the speech recognition. 

As described above, because the voice data given in the 
form of the observation vector a is converted into the 
feature distribution parameter Z representing the 
distribution in the feature vector space, that is, the space 
of feature values of the voice data, the feature 
distribution parameter is determined taking into account the 
distribution characteristic of noise included in the voice 
data Furthermore, because the identification function 
corresponding to the non-speech acoustic model for 
discriminating (detecting) a non-speech sound is updated on 
the basis of the voice data extracted during the noise 
observation interval Tn immediately before the start of the 
speech, a greater improvement in the speech recognition rate 
is achieved. 
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In the case where the non-speech acoustic model is not 
adapted, the speech recognition rate reduces greatly with 
increasing non- speech interval Ts from a time at which the 
speech switch 4 is turned on to a time at which speech is 
started (Fig. 2). In contrast, in the case where the non- 
speech acoustic model is adapted, the reduction in the 
speech recognition rate can be suppressed to a very low 
level even when the non-speech interval Ts becomes long 
thereby making it possible to achieve high recognition 
performance substantially regardless of the length of the 
non-speech interval Ts . 

In the adaptation of the non-speech acoustic model 
using the non-speech feature distribution F^y) {= N^, o L 2 )) 
by means of the most likelihood method, the complex 
statistic method, or the minimum distance-maximum separation 
theorem, a time series of non-speech feature distributions 
F i(Y)/ F 2 (Y)f F N (y) obtained from the respective N 

frames during the noise observation interval Tn (Fig. 2) are 
treated in the same manner. 

However, strictly speaking, the ambient noise in the 
speech recognition interval is not identical to the ambient 
noise in the noise observation interval Tn immediately 
before the speech recognition interval. Besides, in general, 
the deviation of the ambient noise at a particular point of 
time in the noise observation interval Tn from the ambient 
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noise in the speech recognition interval increases with the 
separation between that particular point of time and the 
speech recognition interval (start time t c of the speech 
recognition interval). 

In view of the above, it is more desirable not to 
equally deal with the time series of non-speech feature 
distributions F^y), F 2 (y), . .., F N (y) obtained from the 
respective N frames in the noise observation interval Tn 
(Fig. 2) but to deal with them such that a non-speech 
feature distribution nearer the speech recognition interval 
is weighted more heavily (a non-speech feature distribution 
farther away from the speech recognition interval is 
weighted more lightly) thereby making it possible to adapt 
(correct or update) the non-speech acoustic model so as to 
further improve the speech recognition accuracy. 

For the above purpose, a freshness degree is introduced 
to represent the freshness (the proximity from the speech 
recognition interval) of the non-speech feature 
distributions Fj(y), F 2 (y), F N (y) obtained in the noise 

observation interval Tn, and the non-speech acoustic modes 
is adapted taking into account the freshness degree as 
described below. 

Fig. 15 illustrates an example of a manner in which the 
non-speech acoustic model correction unit 7 shown in Fig. 1 
is constructed so as to adapt the non-speech acoustic model 
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taking into account the freshness degree. 

A freshness function storage unit 31 stores a freshness 
function representing the degree of freshness (or a 
parameter which defines the freshness function). 

Voice data in the form of a series of observation 
vectors (N frames of voice data) extracted by the noise 
observation interval extractor 3 during the noise 
observation interval Tn is input to a correction unit 32. 
The correction unit 32 extracts non-speech feature 

distributions FJy), F 2 (y), , F N (y) from the observation 

vectors and adapts the non-speech acoustic model on the 
basis of the extracted non-speech feature distributions and 
the freshness function stored in the freshness function 
storage unit 31. 

Herein, the non-speech feature distributions F^y), 
F 2(Y)f •••r F N (y) have discrete values observed in the 
respective N frames during the noise observation interval Tn. 
If the non-speech acoustic model correction unit 7 is 
capable of dealing with discrete values, the non-speech 
feature distributions F^y), F 2 (y), F N (y) having 

discrete values can be directly used. However, in the case 
where the non-speech acoustic model correction unit 7 is 
designed to deal with a continuous value, it is required to 
convert the non-speech feature distributions F^y), 
F 2(Y) r i F N (y) having discrete values into continuous 
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values using a discrete-to-continuous converter so that the 
non-speech acoustic model correction unit 7 can perform a 
process correctly. The conversion of the discrete values 
into a continuous value may be achieved, for example, by 
performing approximation using a spline function. 

Herein, the discrete values refer to a finite number of 
values observed at discrete times in an observation interval 
having a finite length, and the continuous values refer to 
an infinite number of values observed at arbitrary times in 
the observation interval with a finite (or infinite) length, 
wherein the continuous values are represented by a certain 
function. 

In the case where the non-speech feature distributions 
used to adapt the non-speech acoustic model are given in the 
form of discrete values, the refresh function is a function 
of discrete values. However, in the case where the non- 
speech feature distributions are given in the form of 
continuous values, the refresh function is a function of 
continuous values. 

The refresh function and the adaptation of the non- 
speech acoustic model using the refresh function are 
described below. 

The refresh function F(x) may be defined, for example, 
by equations (41) to (43). 



- 52 - 



F(x)=0 



if x £ Q, 



(41) 



F(x 2 ) ;> F(xj 



if x 2 £ X; 



(42) 



JL F(x)dx = 1 



(43) 



where £2 obs denotes the observation interval of the non-speech 
feature distributions. In the present embodiment, Q obs 
corresponds to the noise observation interval Tn. 

According to equation (41), the refresh function F(x) 
has a value of 0 for x outside the observation interval Q obs . 
According to equation (42), the refresh function F(x) has a 
constant value or increases with a passage of time within 
the observation interval Q obs . This means that the refresh 
function F(x) basically has a greater value for x closer to 
the speech recognition interval (Fig. 2). Furthermore, 
according to equation (43), when the refresh function F(x) 
is integrated over the observation interval Q ob3 , the result 
must be equal to 1. Fig. 17 illustrates an example of the 
refresh function F(x) which satisfies the conditions given 
by equations (41) to (43). 

In the present embodiment, the refresh function F(x) is 
used as a multiplier of the non-speech feature distributions, 
as will be described later with reference to equation (44). 
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Therefore , when the refresh function F(x) has a positive or 
negative values the refresh function F(x) serves as a weight 
applied to a non-speech feature distribution. When the 
refresh function F(x) has a value equal to 0, the refresh 
function F(x) makes a non-speech feature distribution 
invalid when the non-speech feature distribution is 
multiplied by the refresh function F(x) so as to have no 
influence upon the adaptation of the non-speech acoustic 
model . 

The correction unit 32 shown in Fig. 15 determines the 
adapted non-speech acoustic model F(y) using the refresh 
function F(x) described above and the non-speech feature 

distributions F-^y), F 2 (y), , F N (y), basically in 

accordance with equation (44). 

F-(y) = V(F(l)P 1 (y), F(2)F 2 (y), F(N)F N (y) ) (44) 

According to equation (44), the non-speech feature 
distributions are dealt with in the adaptation of the non- 
speech acoustic model such that a non-speech feature 
distribution closer to the speech recognition interval is 
weighted more heavily, thereby achieving a further 
improvement in the speech recognition accuracy. 

The speech recognition apparatus according to the 
present invention has been described above. Such a speech 
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recognition apparatus may be used, for example, in a car 
navigation apparatus capable of accepting a command issued 
via a voice and also other various types of apparatuses. 

In the above-described embodiments , a feature 
distribution parameter is determined taking into account the 
distribution characteristic of noise. The noise may include 
not only ambient noise in an environment where speech is 
made but also other noise such as that arising from the 
characteristic of a communication line such as a telephone 
line via which voice to be recognized is transmitted. 

The present invention can be applied not only to speech 
recognition but also to another pattern recognition such as 
image recognition. 

Although in the above-described embodiments, the non- 
speech acoustic model is adapted using a non-speech feature 
distribution represented in a feature space, the non-speech 
acoustic model may also be adapted using a feature value of 
noise represented as a point in the feature space. 

Although in the above-described embodiments, the non- 
speech acoustic model representing noise is adapted, the 
adaptation method according to the present invention may 
also be used to adapt another acoustic model. 

The processing seguence described above may be executed 
by hardware or software. When the processes are performed 
by software, a software program is installed on a general- 
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purpose computer or the like. 

Fig. 18 illustrates an embodiment of the invention in 
which a program used to execute the processes described 
above is installed on a computer. 

The program may be stored, in advance, on a hard disk 
105 serving as a storage medium or in a ROM 103 which are 
disposed inside the computer. 

Alternatively, the program may be stored (recorded) 
temporarily or permanently on a removable storage medium 
such as a floppy disk, a CD-ROM (Compact Disc Read Only 
Memory), a MO (Magnetooptical ) disk, a DVD (Digital 
Versatile Disc), a magnetic disk, or a semiconductor memory. 
Such a removable recording medium 111 may be provided in the 
form of so-called package software. 

Instead of installing the program from the removable 
storage medium 111 onto the computer, the program may also 
be transferred to the computer from a download site via a 
digital broadcasting satellite by means of radio 
transmission or via a network such as a LAN (Local Area 
Network) or the internet by means of wire communication. In 
this case, the computer receives, using a communication unit 
108, the program transmitted in such a manner and installed 
the program on the hard disk 105 disposed in the computer. 

The computer includes therein a CPU (Central Processing 
Unit) 102. When a user inputs a command by operating an 
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input device 107 such as a keyboard or a mouse, the command 
is transferred to the CPU 102 via the input/output interface 
110. In accordance with the command, the CPU 102 executes a 
program stored in the ROM (Read Only Memory) 103. 
Alternatively, the CPU 102 may execute a program loaded in a 
RAM (Random Access Memory) 104 wherein the program may be 
loaded into the RAM 104 by transferring a program stored on 
the hard disk 105 into the RAM 104, or transferring a 
program which has been installed on the hard disk 105 after 
being received from a satellite or a network via the 
communication unit 108, or transferring a program which has 
been installed on the hard disk 105 after being read from a 
removable recording medium 111 loaded on a drive 109, 
whereby the CPU 102 executes the process represented by the 
above-described block diagram. The CPU 102 outputs the 
result of the process, as required, to an output device 106 
such as a LCD (Liquid Crystal Display) or a loudspeaker via 
an input/output interface 110. The result of the process 
may also be transmitted via the communication unit 108 or 
may be stored on the hard disk 105. 

In the present invention, the processing steps 
described in the program to be executed by a computer to 
perform various kinds of processing are not necessarily 
required to be executed in time sequence according to the 
order described in the flow chart. Instead, the processing 
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steps may be performed in parallel or separately (by means 
of parallel processing or object processing) . 

The program may be executed either by a single computer 
or by a plurality of computers in a distributed fashion. 
The program may be transferred to a computer at a remote 
location and may be executed thereby. 

In the model adaptation apparatus, the model adaptation 
method, the store medium, and the pattern recognition 
apparatus, according to the present invention, as described 
above, input data corresponding to a predetermined model 
observed during a predetermined interval is extracted and 
output as extracted data. The predetermined model is 
adapted using the data extracted during the predetermined 
interval by means of one of the most likelihood method, the 
complex statistic method, and the minimum distance-maximum 
separation theorem, thereby making it possible to perform 
pattern recognition using the adapted model and thus high 
recognition performance can be achieved. 



